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Convex Tensor Decomposition via Structured Schatten Norm 

Regularizat ion 



Abstract 

We discuss structured Schatten norms for 
tensor decomposition that includes two re- 
cently proposed norms ("overlapped" and 
"latent" ) for convcx-optimization-based ten- 
sor decomposition, and connect tensor de- 
composition with wider literature on struc- 
tured sparsity. Based on the properties of the 
structured Schatten norms, we mathemati- 
cally analyze the performance of "latent" ap- 
proach for tensor decomposition, which was 
empirically found to perform better than the 
"overlapped" approach in some settings. We 
show theoretically that this is indeed the 
case. In particular, when the unknown true 
tensor is low-rank in a specific mode, this 
approach performs as good as knowing the 
mode with the smallest rank. Along the 
way, we show a novel duality result for struc- 
tures Schatten norms, establish the consis- 
tency, and discuss the identifiability of this 
approach. We confirm through numerical 
simulations that our theoretical prediction 
can precisely predict the scaling behaviour of 
the mean squared error. 



1. Introduction 

Decomposition of tensors (Kolda & Bader, 2009) (or 
multi-way arrays) into low-rank components arises 
naturally in many real world data analysis problems. 
For example, in neuroimaging, we are often interested 
in finding spatio-temporal patterns of neural activities 
that are related to certain experimental conditions or 
subjects; one way to do this is to compute the decom- 
position of the data tensor, which can be of size chan- 
nels x time-points x subjects x conditions (M0rup, 
2011). In computer vision, an ensemble of face images 
can be collected into a tensor of size pixels x subjects 
x illumination x viewpoints; the decomposition of this 



(a) Overlapped approach 



(b) Latent approach 





Figure 1. Schematic illustrations of the overlapped ap- 
proach and the latent approach for the decomposition of a 
three way tensor (K = 3). 



tensor yields the so called tensorfaces (Vasilcscu & Ter- 
zopoulos, 2002), which can be regarded as a multi- 
linear generalization of eigenfaces (Sirovich & Kirby, 
1987). 

Conventionally tensor decomposition has been tack- 
led through non-convex optimization problems, using 
alternate least squares or higher order orthogonal it- 
eration (De Lathauwer et al., 2000). Although be- 
ing successful in many application areas, the statisti- 
cal performance of such approaches has been widely 
open. Moreover, the model selection problem can be 
highly challenging, especially for the so called Tucker 
model (Tucker, 1966; De Lathauwer et al., 2000), be- 
cause we need to specify the rank ru for each mode 
(here a mode refers to one dimensionality of a tensor); 
that is, we have K hyper-parameters to choose for a 
i^-way tensor, which is challenging even for K = 3. 

Recently a convex-optimization-based approach for 
tensor decomposition has been proposed by several au- 
thors (Signoretto et al., 2010; Gandy et al., 2011; Liu 
et al., 2009; Tomioka et al., 2011a), and its perfor- 
mance has been analyzed in (Tomioka et al., 2011b). 

The basic idea behind their convex approach, which 
we call overlapped approach, is to unfold 1 a tensor into 
matrices along different modes and penalize the un- 



Preliminary work. Under review by the International Con- 
ference on Machine Learning (ICML). Do not distribute. 



For a if- way tensor, there are K ways to unfold a 
tensor into a matrix. See Section 2. 
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Figure 2. Estimation of a low-rank 50x50x20 tensor of 
rank rxrx3 from noisy measurements. The noise standard 
deviation is a = 0.1. The estimation errors of two convex 
optimization based methods are plotted against the rank 
r of the first two modes. The solid lines show the error at 
the fixed regularization constant A, which is 0.89 for the 
overlapped approach and 3.79 for the latent approach (see 
also Figure 3). The dashed lines show the minimum error 
over candidates of the regularization constant A from 0.1 to 
100. In the inset, the errors of the two approaches are plot- 
ted against the regularization constant A for rank r — 40 
(marked with gray dashed vertical line in the outset). The 
two values (0.89 and 3.79) are marked with vertical dashed 
lines. Note that both approaches need no knowledge of the 
true rank; the rank is automatically learned. 



folded matrices to be simultaneously low-rank based 
on the Schatten 1-norm, which is also known as the 
trace norm and nuclear norm (Fazel et al., 2001; Sre- 
bro et al., 2005; Recht et al., 2010); sec the left panel 
of Figure 1. The convex approach does not require 
the rank of the decomposition to be specified before- 
hand, and due to the low-rank inducing property of 
the Schatten 1-norm, the rank of the decomposition is 
automatically determined. 

However, it has been noticed that the above over- 
lapped approach has a limitation that it performs 
poorly for a tensor that is only low-rank in a certain 
mode (Tomioka et al., 2011a). They proposed an al- 
ternative approach, which we call latent approach, that 
decomposes a given tensor into a a mixture of tensors 
that each are low-rank in a specific mode; see the right 
panel of Figure 1. Figure 2 demonstrates that the la- 
tent approach is preferable to the overlapped approach 
when the underlying tensor is almost full rank in all 
but one mode. 

However, there are two issues that are not properly 
addressed so far. 

The first issue is the statistical performance of the la- 



tent approach. In this paper, we show that the mean 
squared error of the latent approach scales no greater 
than the minimum mode-fc rank of the underlying true 
tensor, which clearly explains why the latent approach 
suffers less than the overlapped approach in Figure 2. 

The second issue is the identifiability of the model un- 
derlying the latent approach, i.e., a mixture of low- 
rank tensors. In this paper, we show that such a mix- 
ture is identifiable only when the mixture consists of 
one component; in other words, when the underlying 
tensor is low-rank in a specific mode. 

Along the way, we show a novel duality between the 
two types of norms employed in the above two ap- 
proaches, namely the overlapped Schatten norm and 
the latent Schatten norm. This result is closely re- 
lated and generalize the results in structured sparsity 
literature (Bach et al., 2011; Jcnatton et al., 2011; 
Obozinski et al., 2011; Maurer & Pontil, 2011). In 
fact, the (plain) overlapped group lasso constrains the 
weights to be simultaneously group sparse over over- 
lapping groups. The latent group lasso predicts with a 
mixture of group sparse weights (see also Wright et al., 
2010; Jalali et al., 2010; Agarwal et al., 2011). These 
approaches clearly correspond to the two variations of 
tensor decomposition algorithms we discussed above. 

Finally we empirically compare the overlapped ap- 
proach and latent approach and show that even when 
the unknown tensor is simultaneously low-rank, which 
is a favorable situation for the overlapped approach, 
the latent approach performs better in many cases. 
Thus we provide both theoretical and empirical ev- 
idence that for noisy tensor decomposition, the la- 
tent approach is preferable to the overlapped ap- 
proach. Our result is complementary to the previous 
study (Tomioka et al., 2011a;b), which mainly focused 
on the noise-less tensor completion setting. 

This paper is structured as follows. In Section 2, we 
provide basic definitions of the two variations of struc- 
tured Schatten norms, namely the overlapped/latent 
Schatten norms, and discuss their properties, espe- 
cially the duality between them. Section 3 presents our 
main theoretical contributions; we establish the con- 
sistency of the latent approach, we show a denoising 
performance bound, and discuss the identifiability of 
the model underlying it. In Section 4, wc empirically 
confirm the scaling predicted by our theory. Finally, 
Section 5 concludes the paper. 
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2. Structured Schatten norms for 
tensors 

In this section, we define the overlapped Schatten 
norm and the latent Schatten norm and discuss their 
basic properties. 

First we need some basic definitions. 



Let W G R n ^- n K be a if- way tensor. We denote the 
total number of entries in W by N = Yl k=1 n k . The 
dot product between two tensors W and X is defined 
as (W,X) = vec(W) T vec(A'); i.e., the dot product 
as vectors in M. N . The Frobenius norm of a tensor 
is defined as |||W||| F = \J (W, W). Each dimensional- 
ity of a tensor is called a mode. The mode fc unfold- 
ing W [k) & K"fc xAr /™fc is a matrix that is obtained 
by concatenating the mode-fc fibers along columns; 
here a mode-fc fiber is an n k dimensional vector ob- 
tained by fixing all the indices but the fcth index of 
W. The mode-fc rank r k of W is the rank of the 
mode-fc unfolding X( k y We say that a tensor W has 
Tucker rank (r 1: . . . ,r K ) if the mode-fc rank is r k for 
k = l,...,K (Kolda & Bader, 2009). The mode fc 
folding is the inverse of the unfolding operation. 

2.1. Overlapped Schatten norms 

The low-rank inducing norm studied in (Signorctto 
ct al., 2010; Gandy ct ah, 2011; Liu et al., 2009; 
Tomioka ct al., 2011a), which we call overlapped 
Schatten 1-norm, can be written as follows: 



\w 



(fc)IISi- 



(1) 



In this paper, we consider the following more general 
overlapped S p / q-norm, which includes the Schatten 1- 
norm as the special case (p, q) = (1,1). The over- 
lapped Sp/q-riOTTa is written as follows: 



S P /q 



\w w \\l 



t/<l 



(2) 



where 1 < p,q < oo; here 



i//< 



is the Schatten p-norm for matrices, where aj(W) is 
the jth largest singular value of W. 

When used as a rcgularizer, the overlapped Schatten 
1-norm penalizes all modes of W to be jointly low- 
rank. It is related to the overlapped group regulariza- 
tion (see Jenatton ct al., 2011; Mairal et al., 2011) in 
a sense that the same object W appears repeatedly in 
the norm. 



The following inequality relates the overlapped Schat- 
ten 1-norm with the Frobenius norm, which was a key 
step in the analysis of Tomioka et al. (2011b): 



Si/1 



< 



K 

E 

fc=i 



(3) 



where r k is the mode-fc rank of W. 



Now we are interested in the dual norm of the over- 
lapped Sp/g-norm, because deriving the dual norm is 
a key step in solving the minimization problem that 
involves the norm (2) (see Mairal et al., 2011), as 
well as computing various complexity measures, such 
as, Rademacher complexity (Foygel & Srebro, 2011) 
and Gaussian width (Chandrasekaran et al., 2010). It 
turns out that the dual norm of the overlapped S p /q- 
norm is the latent S p * /q* -norm as shown in the fol- 
lowing lemma. 

Lemma 1. The dual norm of the overlapped S p /q- 
norm is the latent S p * /q* -norm, where l/p+ 1/p* = 1 
and 1/q + 1/q* = 1, which is defined as follows: 



(XI 



inf 

.. + X(i<))=X 



El 



fc=i 



\X 



(fe)i 
(fc) 1 



1/9* 



(4) 



Here the infimum is taken over the K -tuple of tensors 
X^\ . . . , X^ that sums to X . 



Proof. The proof is presented in Appendix A. 



□ 



The duality in the above lemma naturally generalizes 
the duality between overlapped/latent group sparsity 
norms that have only partial overlap (in contrast to 
the complete overlap here). Although being recog- 
nized in special instances (Jalali et al., 2010; Obozin- 
ski et al., 2011; Maurer & Pontil, 2011; Agarwal et al., 
2011), to the best of our knowledge, this duality has 
not been presented in the generality of Lemma 1. 
Note that when the groups have no overlap, the over- 
lapped/latent group sparsity norms become identical, 
and the duality is the ordinary duality between the 
group Sp/g-norms and the group S p * /g*-norms. 

2.2. Latent Schatten norms 

The latent approach for tensor decomposition pro- 
posed by Tomioka ct al. (2011a) solves the following 
minimization problem 



A" 



minimize L(W^ 

W( 1 >,...,w(- K > 



A£ll^ (fc) 



fc=l 



(fc)HSi, 

(5) 
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where L is a loss function, A is a regularization con- 3. Main theoretical results 

stant, and W^j is the modc-A: unfolding of W' fe '. In- 
tuitively speaking, the latent approach for tensor de- 
composition predicts with a mixture of K tensors that 
each are regularized to be low-rank in a specific mode. 

Now, since the loss term in the minimization prob- 
lem (5) only depends on the sum of the tensors 
W^ 1 -*, . . . , W^ K \ minimization problem (5) is equiv- 
alent to the following minimization problem 

minimize L(W) + A|||>V III-5— tt-- 

W Ml,bi/1 



In other words, we have identified the structured 
Schatten norm employed in the latent approach as 
the latent Si /1-norm (or latent Schatten 1-norm for 
short), which can be written as follows: 

K 

IIMIIsTTT^ inf ^ HW W ?ihi- ( 6 ) 

According to Lemma 1, the dual norm of the latent 
Si/ 1-norm is the overlapped S'oo/oo-norm 



Soo/00 



max||X (fc) || Soc , 



(7) 



where || • Ws^ is the spectral norm. 

The following lemma is similar to inequality (3) and is 
a key in our analysis. 

Lemma 2. 



^71 < I mm ^ I 



where r k is the mode-k rank of W . 

Proof. Since we are allowed to take a singleton decom- 
position W {k) = W and W^'' = (k' ^ k), we have 

K 

IIMbr7T= inf , £ll^«lk 
< \\w w \\ 8l 

<Vi k \\w lk) \\ (Vfc = i,...,x) 

Choosing k that minimizes the right hand side, we 
obtain our claim. □ 

Compared to inequality (3), the latent Schatten 1- 
norm is bounded by the minimal square root of the 
ranks instead of the sum. This is the fundamental 
reason why the latent approach performs betters than 
the overlapped approach as in Figure 2. 



In this section, we study the consistency, generaliza- 
tion performance, and identifiability of the latent ap- 
proach for tensor decomposition in the context of re- 
covering an unknown tensor W* from noisy measure- 
ments. This is the setting of the experiment in Fig- 
ure 2. 

First, we show that the latent approach is consistent. 
That is, the error goes to zero when the noise goes 
to zero, which corresponds to the situation when the 
entries are repeatedly observed. 

Second, combining the duality we presented in the pre- 
vious section with the techniques from Agarwal et al. 
(2011), we analyze the denoising performance of the 
latent approach in the context of recovering an un- 
known tensor W* from noisy measurements. This is 
the setting of the experiment in Figure 2. Wc first 
prove a deterministic inequality that holds under cer- 
tain condition on the regularization constant. Next, 
we assume Gaussian noise and derive an inequality 
that holds with high probability under an appropri- 
ate scaling of the regularization constant. 

Third, wc discuss the difference between overlapped 
approach and latent approach and provide an explana- 
tion for the empirically observed superior performance 
of the latent approach in Figure 2. 

Finally we discuss the condition under which the de- 
composition W = Ylk=i is identifiable and show 
that the model is (locally) identifiable only when the 
mixture consists of one component. 

3.1. Consistency 

Let W* be the underlying true tensor and the noisy 
version y is obtained as follows: 

y = W* + S, 

where £ <G M. niX '" XTlK is the noise tensor. 

First we establish the consistency of the latent ap- 
proach. 

Theorem 1. The estimator defined by 

VV = argmin^lly - W\f F + , (8) 

is consistent. That is, when the noise goes to zero 
(e.g., when the entries are repeatedly observed), W — > 
W* for any sequence A — > 0. 

Proof. Due to the triangular inequality 

HI Vv - w*\\\ F < III vv - y\\\ F + \\\y - w*\\\ F . 
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Here the second term goes to zero as the noise shrinks. 
Next, from the optimality of VV, the first term satisfies 



y-w £ Xd\\\W\\ 



Si/V 



where d VV Wis—. rr is the subdiffcrcntial of the latent 

in niai/l 

Si/1 norm at VV. Now since the dual norm of the la- 
tent 5*1/1 norm is the overlapped Sao/oo norm, for any 



G € ^l^lsTTx' we nave 1^1 



Soo/C 



< 1, and therefore 



VV 



y\\\ F < c\\\w 



y\L 1 <cx, 



where C is a constant that is independent of A. There- 
fore, for any sequence A — > 0, we have VV — > W* when 
£ -> 0. □ 

3.2. Deterministic bound 

The consistency statement in the previous section only 
deals with the sum VV = Ylk=i VV 1 ^ and its conver- 
gence to the truth W* in the limit the noise goes to 
zero. In this section, we establish a stronger statement 
that shows the behavior of individual terms VVw and 
also the denoising performance. 

To this end we need some additional assumptions. 

First , we assume that the unknown tensor W* is a mix- 
ture of K tensors that each are low-rank in a certain 
mode and we have a noisy observation y as follows: 



y = w* + s 



where r k = rank(wiS) is the modc-/c rank of the kih 
component W* 1 ^. 

Second, we assume that the spectral norm of the mode- 
k unfolding of the Ith component is bounded by a con- 
stant a for all k ^ I as follows: 



V A ' w* (fe) 



£. 



(9) 



(10) 



Note that such an additional incoherence assumption 
has also been used in (Candcs ct al., 2009; Wright 
et al., 2010; Agarwal et al., 2011; Hsu et al., 2011). 

We employ the following optimization problem to re- 
cover the unknown tensor W*: 



VV = argminf Hly - W\\\l + Al^WH^ 
w \ 2 01 ' 



s.t. \\w 




where W = Ylk=i VV^'-* denotes the optimal decom- 
position induced by the latent Schatten 1-norm (6); 
A > is a regularization constant. Notice that we 
have introduced additional spectral norm constraints 
to control the correlation between the components (see 
also Agarwal et al., 2011). 

Our first bound can be stated as follows: 

Theorem 2. Let VV^ be an optimal decomposition 
of W induced by the latent Schatten 1-norm (6). As- 
sume that the regularization constant A satisfies A > 
2 \\\£ L , +a(K—l). Then there is a universal con- 

III lll iSoo/oo v ' 

stant c such that, any solution W of the minimiza- 
tion problem (11) satisfies the following deterministic 
bound: 



K 



]T|||vvw -w*w\\\ 2 F <c\ 2 J2 



'7v 



fc=l 



fe=l 



Proof. The proof is presented in Appendix B. 



(12) 



□ 



We can also obtain a bound on the difference of the 
whole tensor W — W* rather than the squared sum 
differences as in Theorem 2 as follows. 

Corollary 1. Under the same conditions as in Theo- 
rem 2 we have 



VV- VV* 



I* <ciTA 2 f>. 
fe=i 



(13) 



Proof. Using the 
and Cauchy-Schwarz 

III vv - wiL 



El 



triangular 
inequality 
JIIVV^ - 



i2 

If' 



inequality 
we have 

w*( fe )||| F < 

□ 



Since we are bounding the overall error in (13), we may 
exploit the arbitrariness of the decomposition W* = 
EfcU W* (fc) to obtain a tight bound. The tightest 
bound is obtained when we choose the decomposition 
that minimizes the sum of the ranks Ek=i ^ fc ' ^ e sa y 
W* has the latent rank (r\, . . . , ¥k) for such a minimal 
decomposition in terms of the sum. 

A simple upper bound is obtained by choosing a de- 
composition W* {k) = W* and W* (fe,) = for k' ^ k. 
In particular by choosing the mode with the minimum 
mode-A: rank, we obtain 

IN VV 7 — W*|||i < cKX 2 min r k , 

111 IIIF ~ k=l....,K 

where r k is the mode-/c rank of W*. We refer to the 
above decomposition as the minimum rank singleton 
decomposition. 
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Note that the right-hand side of our bound (12) does 
not necessarily go to zero when the noise £ goes to 
zero, because A > a(K — 1). When the noise goes 
to zero, VV — > W* can be obtained by any decreasing 
sequence A — > as shown in the previous subsection. 
Therefore our bound is most useful when the noise is 
relatively large and the first term 2|||f , domi- 
nates the second term a(K — 1) in the condition for 
the regularization constant A. 

3.3. Gaussian noise 

When the elements of the noise tensor £ are Gaussian, 
we obtain the following theorem. 

Theorem 3. Assume that the elements of the noise 
tensor £ are independent Gaussian random variables 
with variance a 2 . In addition, assume without loss of 
generality that the dimensionalities o/W* are sorted in 
the descending order, i.e., ni > ■■■ > tik- Then there 
are universal constants cq, C\ such that, with high prob- 
ability, any solution of the minimization problem (11) 
with regularization constant A = c$o(^N /nx + \fn\-\- 
y/logK) + a(K — 1) satisfies the following bound: 



AT Z__JII INF — 



K 



N 



k=l 



where F 



n K 



(14) 



((1+ ^W) + (VWK- 



) / UK 

N 



is a factor that mildly depends on the dimensionalities 
and the constant a in (10). 



Proof. The proof is presented in Appendix C 



□ 



Note that the theoretically optimal choice of regular- 
ization constant A is independent of the Tucker/latent 
rank of the truth W*, which is unknown in practice. 

Again we can obtain a bound corresponding to the 
minimum rank singleton decomposition as in inequal- 
ity (13) as follows: 



_L||VV_yv* 



(15) 



where F is the same factor as in Theorem 3. 



3.4. Comparison with the overlapped approach 

Inequality (15) explains the superior performance of 
the latent approach for tensor decomposition in Fig- 
ure 2. The inequality obtained in (Tomioka et al., 
2011b) for the overlapped approach that uses over- 



lapped Schatten 1-norm (1) can be stated as follows: 

2 / ^ v 2 



I||yv-W*||£< 



(16) 



Comparing inequalities (15) and (16), we notice that 
the complexity of the overlapped approach depends 
on the average (square root) of the Tucker rank 
Lit ■ ■ , Lk i whereas that of the latent approach only 
grows linearly against the minimum Tucker rank. In- 
terestingly, the latent approach performs as if it knows 
the mode with the minimum rank, although such in- 
formation is not available to it. However in inequal- 
ity (15) we have the factor K. This means that if 
the mode with the minimum rank is known, the latent 
approach looses by constant factor K against the sim- 
ple matrix decomposition approach that unfolds the 
given tensor at the minimal rank mode and performs 
ordinary Schatten 1-norm minimization. 



3.5. Discussion on the identifiability 

Let ft = rank(W"/S) be the mode-fc rank of the fcth 



component W^' in the decomposition 

yy = yy(i) + yy(2) _| 1_ yy(*0 



(17) 



We say that a decomposition (17) is locally identifi- 
able when there is no other decomposition VV' fc - ) 
having the same rank (fx, ■ ■ ■ , Tk)- The following the- 
orem fully characterizes the local identifiability of the 
decomposition (17). 

Theorem 4. The decomposition (17) is locally iden- 
tifiable if and only if W*- fe ^ = W for k = k* and 
yy( fe ) = o otherwise, for some k* . 



Proof. The proof is given in Appendix D. 



□ 



The above theorem partly explains the difficulty of es- 
timating individual components VV*W without addi- 
tional incoherence assumption as in (10). In fact, most 
decompositions of the form (9) are not identifiable. 

4. Numerical results 

In this section, we numerically confirm the scaling be- 
havior we have theoretically predicted in the last sec- 
tion. 

The goal of this experiment is to recover the true low 
rank tensor W* from a noisy observation y. We ran- 
domly generated the true low rank tensors W* of size 
50 x 50 x 20 or 80 x 80 x 40 with various Tucker ranks 
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(r 1 ,r 2 ,r 3 ). A low-rank tensor is generated by first 
randomly drawing the rj x r 2 x r 3 core tensor from 
the standard normal distribution and multiplying an 
orthogonal factor matrix drawn from the Haar mea- 
sure to its each mode. The observation tensor y is 
obtained by adding Gaussian noise with standard de- 
viation a = 0.1. There is no missing entries in this 
experiment. 

For an observation y, we computed tensor decompo- 
sitions using the overlapped approach and the latent 
approach (11) using the solver available from the web- 
page 2 of one of the authors of Tomioka et al. (2011a). 
The solver uses the alternating direction method of 
multipliers (Gabay & Merrier, 1976) and the algorithm 
is described in the above paper. We computed the so- 
lutions for 20 candidate regularization constants rang- 
ing from 0.1 to 100 and report the results for three 
representative values for each method. 

We measured the quality of the solutions obtained by 
the two approaches by the mean squared error (MSE) 

in A iii2 

|W — W*||| F /iV. In order to make our theoretical 
predictions more concrete, we define the quantities in 
the right hand side of the bounds (16) and (14) as 
Tucker rank (TR) complexity and Latent rank (LR) 
complexity, respectively, as follows: 

TR complexity = (i £f =1 (i £f =1 , 

(18) 

(19) 



LR complexity = 



n K 



> 



where without loss of generality we assume n\ > 
uk ■ We have ignored terms like ^Jn^jN because they 
are negligible for n k w 50 and N 50,000. The 
TR complexity is equivalent to the normalized rank 
in (Tomioka et al., 2011b). Note that the TR com- 
plexity (18) is defined in terms of the Tucker rank 
fcij • • • iLk) °f the truth W*, whereas the LR com- 
plexity (19) is defined in terms of the latent rank 
(fi, . . . , ¥k) (see Section 3.2). In order to compute 
the sum of latent ranks Ylk=i ^ k > we ran ^ ne latent ap- 
proach to the true tensor W* without noise, and took 
the minimum of the sums obtained from that and the 
minimum rank singleton decomposition. The whole 
procedure is repeated 10 times and averaged. 

Figure 3 shows the results of the experiment. The 
left panel shows the MSE of the overlapped approach 
against the TR complexity (18). The middle panel 
shows the MSE of the latent approach against the LR 



complexity (19). The right panel shows the improve- 
ment (i.e., MSE of the overlap approach divided by 
that of the latent approach) against the ratio of the 
respective complexity measures. 

First, from the left panel we can confirm that as pre- 
dicted by (Tomioka et al., 2011b), the MSE of the 
overlapped approach scales linearly against the TR 
complexity (18) for each value of the regularization 
constant. We can also see that as predicted by Theo- 
rem 3, by scaling the regularization constant propor- 
tionally with \J N/riK, the series corresponding to size 
50 x 50 x 20 and those corresponding to size 80 x 80 x 40 
almost lie on top of each others. 

From the central panel, we can clearly see that the 
MSE of the latent approach scales linearly against the 
LR complexity (19) as predicted by Theorem 3. The 
series with A (A = 3.79 for 50 x 50 x 20, A = 5.46 for 
80 x 80 x 40) is mostly below other series, which means 
that the optimal choice of the regularization constant 
is independent of the rank of the true tensor and only 
depends on the size; this agrees with the condition on A 
in Theorem 3. Since the blue series and red series with 
the same markers lie on top of each other (especially 
the scries with A for which the optimal regularization 
constant is chosen) , we can see that our theory predicts 
not only the scaling against the latent ranks but also 
that against the size of the tensor correctly. Note that 
the regularization constants are scaled by roughly 1.6 
to account for the difference in the dimensionality. 

The right panel reveals that in many cases the la- 
tent approach performs better than the overlapped 
approach, i.e., MSE (overlap)/ MSE (latent) greater 
than one. Moreover, we can see that the success of the 
latent approach relative to the overlapped approach is 
correlated with high TR complexity to LR complexity 
ratio. Indeed, we found that the optimal decomposi- 
tion of the true tensor W* was typically a singleton de- 
composition corresponding to the smallest tucker rank 
(see Section 3.2). 

One might think that we can fix the overlapped ap- 
proach by allowing individual regularization constant 
for each mode. However, this would only be possible 
if we knew the mode with small rank. 

The improvements here are milder than that in Fig- 
ure 2. This is because most of the randomly generated 
low-rank tensors were simultaneously low-rank to some 
degree. It is interesting that the latent approach per- 
form at least as good as the overlapped approach also 
in such situations. 



2 http: //www. ibis . t .u-tokyo . ac . jp/RyotaTomioka/ 
Softwares/Tensor 
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Figure 3. Performance of the overlapped approach and latent approach for tensor decomposition are shown against their 
theoretically predicted complexity measures (see Eqs. (18) and (19)). The right panel shows the improvement of the latent 
approach from the overlapped approach against the ratio of their complexity measures. 



5. Conclusion 

In this paper, wc have presented a framework for struc- 
tured Schatten norms. The current framework in- 
cludes both the overlapped Schatten 1-norm and latent 
Schatten 1-norm recently proposed in the context of 
convex-optimization-based tensor decomposition (Sig- 
noretto et al, 2010; Gandy et al., 2011; Liu et al., 2009; 
Tomioka et al., 2011a), and connects these studies to 
the broader studies on structured sparsity (Bach et al., 
2011; Jenatton et al., 2011; Obozinski et al., 2011; 
Maurer & Pontil, 2011). Moreover, we have shown 
a duality that holds between the two types of norms. 

Furthermore, we have rigorously studied the perfor- 
mance of the latent approach for tensor decomposi- 
tion. We have shown the consistency of the latent 
Schatten 1-norm minimization. Next, we have ana- 
lyzed the denoising performance of the latent approach 
and shown that the error of the latent approach is up- 
per bounded by the minimum Tucker rank, which con- 
trasts sharply against the average (square root) depen- 
dency of the overlapped approach analyzed in Tomioka 
et al. (2011b). This explains the empirically observed 
superior performance of the latent approach compared 
to the overlapped approach. The most difficult case for 
the overlapped approach is when the unknown tensor 
is only low-rank in one mode as in Figure 2. 

We have also confirmed through numerical simulations 
that our analysis precisely predicts the scaling of the 
mean squared error as a function of the dimensional- 
ities and the latent rank of the unknown tensor. Un- 
like Tucker rank, latent rank of a tensor is not easy 



to compute. However, note that the theoretically op- 
timal scaling of the regularization constant does not 
depend on the latent rank. 

Therefore we have theoretically and empirically shown 
that for noisy tensor decomposition, the latent ap- 
proach is more likely to perform better than the over- 
lapped approach. Analyzing the performance of the 
latent approach for tensor completion would be an im- 
portant future work. 

The structured Schatten norms proposed in this pa- 
per include norms for tensors that are not employed 
in practice yet. Therefore, we envision that this pa- 
per serve as a starting point for various extensions, 
e.g., using the overlapped S\ /co-norm instead of the 
Si/l-norm or a non-sparse tensor decomposition sim- 
ilar to the ^p-norm MKL (Micchelli & Pontil, 2005; 
Kloft et al., 2011). 
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Supplementary material for 
"Convex Tensor Decomposition 
via Structured Schatten Norms" 

A. Proof of Lemma 1 

Proof. From the definition, the dual norm \\\X\\ 



(S P /q)' 



can be written as follows: 



|W||| (w =sup<W,*) B.t. 



WL . < 1. 

I \\\Sp/q — 



The basic strategy of the proof is to rewrite the above 
maximization problem as a constraint optimization 
problem and derive the dual problem. 

First, we rewrite the above maximization problem as 
follows: 



X 



(S P /q)* 



1 K 

SU P^H( Zfc ' X ( fe )) 



k=l 



K 



s.t. Z k = W (k) ,J2\\Zk\\%<l, 



k=l 



where Z k g u^xN/n k ( fc = are auxiliary 

variables. 

Next we write down the Lagrangian as follows: 
L = -^Yj( Zk ' X w) 

k 



w 



Kq 

where y^ g m. niX - Xn « (k = 1, . . . , K), and 7 > 
are Lagrangian multipliers. 

Note that for X, Z g R RxC , we have 



(x,z)-l\\z\\i 



< 7sup ( ||X/7|| S 
z 



\z\\,„--\\z\\i 



< 



1 v\\q 



\X\ 



q 



Here the first equality is achieved if we take 
Z cUdiag(a( /p ,...,af /p )V T , where 

(7diag((7i, . . . , cr r )V T is the singular value de- 
composition of the matrix X/j, and c is an arbitrary 
scaling constant. The second equality is achieved if 

wetake||Z|| Sp H|X/7||rJ. 



Thus, maximizing the Lagrangian with respect to Z k 
(k = 1, . . . , K) and W, we obtain the dual problem 



1-9* 



\\X 



inf 



7 



_L \\Y {k) \\ q " 

I (Si/?)* 7 y(i)^ y(K)\ K l ~i* q* ^ " ( k ) lls p* Kq 



s.t. ym + ... + y(K) = x, 

where we used the change of variable ( X + j)w )/ K =: 
y( k ) _ Furthermore, by explicitly minimizing over 7, 
we have 7/if = Q2 k —\ an d we obtain 

the statement of the lemma. □ 

B. Proof of Theorem 2 

Let VV = Ylk=i W ( - fc ' 1 be the solution and its optimal 
decomposition of the minimization problem (11); in 
addition let := W w - W* (fc) . 

The proof is based on Lemmas 3 and 4, which we 
present below. 



In order to present the first lemma, we need the follow- 
ing definitions. Let UkSkVk = W(m be the singular 
value decomposition of the modc-fc unfolding of the 
kth component of the unknown tensor W* . We define 
the orthogonal projection of A^ k ' as follows: 



where 



A' k ' = (I nk U k U k T )A[^(I N/nk - V k V k T ). 

Intuitively speaking, Aj! lies in a subspace completely 
orthogonal to the unfolding of the A:th component 



r*(k) 
(fe) 

space. 



, whereas A' fc lies in a partially correlated sub- 



The following lemma is similar to Negahban et al. 
(2009, Lemma 1) and Tomioka et al. (2011b, Lemma 
2), and it bounds the Schatten 1-norm of the orthogo- 
nal part A' k ' with that of the partially correlated part 
A fe and also bounds the rank of A' fe . 

Lemma 3. Let W be the solution of the minimiza- 
tion problem (11) with the regularization constant X > 

2|||£||L , . Let A'-k-' and its decomposition be as de- 
nt lll&oc/OO r 

fined above. Then we have 



1. rank(Aj,) < 2f k 



2. £ 



K 

k=l 



<3£ 



K 

fe=l 



I A' 



fell Si 



Note that although the proof of the above statement 
closely follows that of Tomioka et al. (2011b, Lemma 
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2), the notion of rank is different. In their result, the 
rank is the Tucker rank r k , whereas the rank here is 
the modc-fc rank of the kth component W*( fc ) of the 
truth. 



The following lemma relates the squared Frobenius 
norm of the difference of the sums |||X)fc=i \\\p with 

i2 

If 



the sum of squared differences ^feLilH^^! 



Lemma 4. Let W be the solution of the minimization 
problem (11). Then we have, 



1 K 

jEIK fe) l 



k=l 



2 <-|l|A||| 2 



K 



*(#-l)£||Ag|| Sl , 



k=l 



where & = J2k =1 A (k) ■ 

Proof of Theorem 2. First from the optimality of VV, 
we have 



a£L «*(!>»* 



< -p- w* 



I 2 

If 



*E* 



which implies 

i|||A|||i<(A,5) + A^ i ||Ag||s 1 



where we used the fact that y = W* + £ and the 
triangular inequality in the first line, and Holder's in- 
equality in the second line. Note that there is an addi- 
tional looseness in the second line due to the fact that 
A = J^fcLi ^® 1S n °t the optimal decomposition of 
A induced by the latent Schatten 1-norm. 

Next, combining inequality (20) with Lemma 4, we 
have 



(21) 



where we used the fact that A > £ 



3 /oo+«(^- 1 )- 



Finally combining inequality (21) with Lemma 3, we 
obtain 



iELll|A W |||,<2AEL(l|A' fc || Sl + ||A' fe '|| Sl) 

<8AEf =1 H A ^ 
<8A^ =i V2^||A' fe || F 

<8A^ =i v^|||A( fe )||| F 



|A( fe )| 



where we used Lemma 3 in the second line, Holder's 
inequality in the third line (combined with Lemma 3), 
the fact that A/S = A' fe + Aj! is an orthogonal de- 
composition in the fourth line, and Cauchy-Schwarz 
inequality in the fifth line. Dividing both sides of 

the last inequality by d J2k=i ||| W^ fe ^ ||| F , we obtain our 
claim. □ 



C. Proof of Theorem 3 

Proof. Since each entry of £ is an independent zero 
men Gaussian random variable with variance <r 2 , for 
each mode k we have the following tail bound (Corol- 
lary 5.35 in (Vershynin, 2010)) 

P (\\E (k) \\ 8oo > o (y/N/n k + 0*) + <) < exp (-t 2 /(2a 2 )) 
Next, taking a union bound 

P ^max||£; (fe )||s 00 > crmax (dN/n k + Jn^j + t 
< Kcxp(-t 2 /(2a 2 )) . 
Substituting t <— t + cryTog K, we have 



P £ 



> amax (y/W/ 



SWoo - k 

t 2 + 2ay/log Kt 



/n k 



+ ay/logK + t 



~ ° XP ' 2a 2 

< exp (-t 2 /(2a 2 )) 

Therefore if c > 2, 

A = c tr U/N/riK + s/fti + V lo & K ) + a ( K - 1) 

> 2 lll g lll5 3o/ oo + ^- 1 ) 

with probability at least 1 — exp ^— ( c °~ 2 ) [N/uk)) , 
which satisfies the condition of Theorem 2. Substitut- 
ing the above A into the right hand side of the error 
bound in Theorem 2 we have the statement of Theo- 
rem 3. □ 

D. Proof of Theorem 4 

Proof. We first prove the "if" direction, suppose that 
there is another decomposition 



V w {k} = v vv (fe) 



such that rank(w|S) 



rank(W^). 



Note that 



F' VV 7^ W can happen only when ^ (otherwise 
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1210 the rank would increase). Also note that W ^ W 

1211 should happen for at least two fc's. Combining these 

1212 we conclude that there are k ^ i such that W {k) ^ 

1213 and ^ 0. 

1214 1269 
inic Conversely, suppose that there are k ^ £ such that 

1216 W(fc) ^ and Wil) ^ °' WC Can write3 

W^=C^x k U k , 

1218 1273 
W^=CW x t U t , 

1220 1275 

12 21 where U k € W 1 "^", C {k) e r^"'™'- 1 "^'^^, 

1222 anc ^ ^ an( i are defined similarly. Since C^) and 

1223 are allowed to be full rank, we can define 

1227 1282 

1228 f° r an y f € Rmx--x f 'fcX-X'"«x---x n K. Then we have 

W {k) + W m - C( fe > x fe [/ fc + CW x, Ut 

= (cW+vM xtUt) x k U k 

1232 v ' 1287 

+ (cW-vM x k U k ) x £ U e 

= vv (fc) + 

1237 1292 

Note that raiikfWVwO = f k > for fc' = k, I. Therefore, 

1239 ( ) ' ' 
there are infinitely many decompositions that have the 

1240 , z- - \ 1295 

1241 BamerankCn,...,!*). 

□ 

1243 1298 

1244 1299 

1245 1300 

1246 1301 

1247 1302 
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1249 1304 

1250 1305 

1251 1306 

1252 1307 

1253 1308 

1254 1309 

1255 1310 

1256 1311 

1257 1312 

1258 1313 

1259 1314 

1260 — ; , , 1315 

Here the tensor mode-fc product A = B x k C is denned 

as a il ...i K = l^ t ti h ^ii-i-iK c ii k where A = y a n-i K ) G 

:Zo R niX - xnK , B = (b n ... e ... lK ) € E »ix-xd fc x...x„ K) and 
1253 C = (c Uk )eR dkXnk 

1264 v kl 1319 



